NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ensemble automated approaches for producing high‐quality herbarium digital records

https://doi.org/10.1002/aps3.11623

Guralnick, Robert P; LaFrance, Raphael; Allen, Julie M; Denslow, Michael W (November 2024, Applications in Plant Sciences)

Abstract PremiseOne of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields. MethodsWe first showcase the development of a rule‐based approach and compare outcomes with a large language model–based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule‐based approaches often have high commission error rates. ResultsOur results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors. DiscussionThis work shows that an ensemble approach has particular value for creating high‐quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.
more » « less
Full Text Available
FloraTraiter: Automated parsing of traits from descriptive biodiversity literature

https://doi.org/10.1002/aps3.11563

Folk, Ryan A; Guralnick, Robert P; LaFrance, Raphael T (January 2024, Applications in Plant Sciences)

Abstract PremisePlant trait data are essential for quantifying biodiversity and function across Earth, but these data are challenging to acquire for large studies. Diverse strategies are needed, including the liberation of heritage data locked within specialist literature such as floras and taxonomic monographs. Here we report FloraTraiter, a novel approach using rule‐based natural language processing (NLP) to parse computable trait data from biodiversity literature. MethodsFloraTraiter was implemented through collaborative work between programmers and botanical experts and customized for both online floras and scanned literature. We report a strategy spanning optical character recognition, recognition of taxa, iterative building of traits, and establishing linkages among all of these, as well as curational tools and code for turning these results into standard morphological matrices. ResultsOver 95% of treatment content was successfully parsed for traits with <1% error. Data for more than 700 taxa are reported, including a demonstration of common downstream uses. ConclusionsWe identify strategies, applications, tips, and challenges that we hope will facilitate future similar efforts to produce large open‐source trait data sets for broad community reuse. Largely automated tools like FloraTraiter will be an important addition to the toolkit for assembling trait data at scale.
more » « less
Full Text Available
Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks

https://doi.org/10.1002/aps3.11560

Guralnick, Robert; LaFrance, Raphael; Denslow, Michael; Blickhan, Samantha; Bouslog, Mark; Miller, Sean; Yost, Jenn; Best, Jason; Paul, Deborah L; Ellwood, Elizabeth; et al (January 2024, Applications in Plant Sciences)

Abstract PremiseAmong the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long‐recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches. MethodsWe present two new semi‐automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans‐in‐the‐loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline. ResultsOur results showcase a >93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre‐processing, multiple OCR engines, and post‐processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4‐fold reductions in errors compared to off‐the‐shelf open‐source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool. DiscussionOur work showcases a usable set of tools for herbarium digitization including a custom‐built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.
more » « less
Full Text Available
Demographic consequences of phenological asynchrony for North American songbirds

https://doi.org/10.1073/pnas.2221961120

Youngflesh, Casey; Montgomery, Graham A.; Saracco, James F.; Miller, David A.; Guralnick, Robert P.; Hurlbert, Allen H.; Siegel, Rodney B.; LaFrance, Raphael; Tingley, Morgan W. (July 2023, Proceedings of the National Academy of Sciences)

Changes in phenology in response to ongoing climate change have been observed in numerous taxa around the world. Differing rates of phenological shifts across trophic levels have led to concerns that ecological interactions may become increasingly decoupled in time, with potential negative consequences for populations. Despite widespread evidence of phenological change and a broad body of supporting theory, large-scale multitaxa evidence for demographic consequences of phenological asynchrony remains elusive. Using data from a continental-scale bird-banding program, we assess the impact of phenological dynamics on avian breeding productivity in 41 species of migratory and resident North American birds breeding in and around forested areas. We find strong evidence for a phenological optimum where breeding productivity decreases in years with both particularly early or late phenology and when breeding occurs early or late relative to local vegetation phenology. Moreover, we demonstrate that landbird breeding phenology did not keep pace with shifts in the timing of vegetation green-up over a recent 18-y period, even though avian breeding phenology has tracked green-up with greater sensitivity than arrival for migratory species. Species whose breeding phenology more closely tracked green-up tend to migrate shorter distances (or are resident over the entire year) and breed earlier in the season. These results showcase the broadest-scale evidence yet of the demographic impacts of phenological change. Future climate change–associated phenological shifts will likely result in a decrease in breeding productivity for most species, given that bird breeding phenology is failing to keep pace with climate change.
more » « less
Full Text Available
Extending mammal specimens with their essential phenotypic traits

https://doi.org/10.1093/jmammal/gyaf062

McLean, Bryan S; Bloom, David; Davis, Edward B; Guralnick, Robert P; Santana, Sharlene E; Allen, Julie M; Amarilla-Stevens, Heidi; Bell, Kayce C; Blackburn, David C; Bradley, Jeffrey E; et al (July 2025, Journal of Mammalogy)

Abstract Natural history collections are repositories of biodiversity specimens that provide critical infrastructure for studies of mammals. Over the past 3 decades, digitization of collections has opened up the temporal and spatial properties of specimens, stimulating new data sharing, use, and training across the biodiversity sciences. These digital records are the cornerstones of an “extended specimen network,” in which the diverse data derived from specimens become digital, linked, and openly accessible for science and policy. However, still missing from most digital occurrences of mammals are their morphological, reproductive, and life-history traits. Unlocking this information will advance mammalogy, establish richer faunal baselines in an era of rapid environmental change, and contextualize other types of specimen-derived information toward new knowledge and discovery. Here, we present the Ranges Digitization Network (Ranges), a community effort to digitize specimen-level traits from all terrestrial mammals of western North America, append them to digital records, publish them openly in community repositories, and make them interoperable with complimentary data streams. Ranges is a consortium of 23 institutions with an initial focus on non-marine mammal species (both native and introduced) occurring in western Canada, the western United States, and Mexico. The project will establish trait data standards and informatics workflows that can be extended to other regions, taxa, and traits. Reconnecting mammalogists, museum professionals, and researchers for a new era of collections digitization will catalyze advances in mammalogy and create a community-curated trait resource for training and engagement with global conservation initiatives.
more » « less
Free, publicly-accessible full text available July 26, 2026
High‐throughput methods for efficiently building massive phylogenies from natural history collections

https://doi.org/10.1002/aps3.11410

Folk, Ryan A.; Kates, Heather R.; LaFrance, Raphael; Soltis, Douglas E.; Soltis, Pamela S.; Guralnick, Robert P. (February 2021, Applications in Plant Sciences)

Full Text Available
The Effects of Herbarium Specimen Characteristics on Short-Read NGS Sequencing Success in Nearly 8000 Specimens: Old, Degraded Samples Have Lower DNA Yields but Consistent Sequencing Success

https://doi.org/10.3389/fpls.2021.669064

Kates, Heather R.; Doby, Joshua R.; Siniscalchi, Carol M.; LaFrance, Raphael; Soltis, Douglas E.; Soltis, Pamela S.; Guralnick, Robert P.; Folk, Ryan A. (June 2021, Frontiers in Plant Science)

Phylogenetic datasets are now commonly generated using short-read sequencing technologies unhampered by degraded DNA, such as that often extracted from herbarium specimens. The compatibility of these methods with herbarium specimens has precipitated an increase in broad sampling of herbarium specimens for inclusion in phylogenetic studies. Understanding which sample characteristics are predictive of sequencing success can guide researchers in the selection of tissues and specimens most likely to yield good results. Multiple recent studies have considered the relationship between sample characteristics and DNA yield and sequence capture success. Here we report an analysis of the relationship between sample characteristics and sequencing success for nearly 8,000 herbarium specimens. This study, the largest of its kind, is also the first to include a measure of specimen quality (“greenness”) as a predictor of DNA sequencing success. We found that taxonomic group and source herbarium are strong predictors of both DNA yield and sequencing success and that the most important specimen characteristics for predicting success differ for DNA yield and sequencing: greenness was the strongest predictor of DNA yield, and age was the strongest predictor of proportion-on-target reads recovered. Surprisingly, the relationship between age and proportion-on-target reads is the inverse of expectations; older specimens performed slightly better in our capture-based protocols. We also found that DNA yield itself is not a strong predictor of sequencing success. Most literature on DNA sequencing from herbarium specimens considers specimen selection for optimal DNA extraction success, which we find to be an inappropriate metric for predicting success using next-generation sequencing technologies.
more » « less
Full Text Available
Migratory strategy drives species-level variation in bird sensitivity to vegetation green-up

https://doi.org/10.1038/s41559-021-01442-y

Youngflesh, Casey; Socolar, Jacob; Amaral, Bruna R.; Arab, Ali; Guralnick, Robert P.; Hurlbert, Allen H.; LaFrance, Raphael; Mayor, Stephen J.; Miller, David A.; Tingley, Morgan W. (July 2021, Nature Ecology & Evolution)
null (Ed.)
Full Text Available
A solution to the challenges of interdisciplinary aggregation and use of specimen-level trait data

https://doi.org/10.1016/j.isci.2022.105101

Balk, Meghan A.; Deck, John; Emery, Kitty F.; Walls, Ramona L.; Reuter, Dana; LaFrance, Raphael; Arroyo-Cabrales, Joaquín; Barrett, Paul; Blois, Jessica; Boileau, Arianne; et al (October 2022, iScience)

Full Text Available

Search for: All records